To study climate change, we used data on the Combined Land-Surface Air and Sea-Surface Water Temperature Anomalies in the Northern Hemisphere at NASA’s Goddard Institute for Space Studies. The tabular data of temperature anomalies can be found here
To define temperature anomalies we need to have a reference/base period which NASA clearly states should be the period between 1951-1980.
This analysis (given by Prof. Kostis) gave us:
weather <-
read_csv("https://data.giss.nasa.gov/gistemp/tabledata_v4/NH.Ts+dSST.csv",
skip = 1,
na = "***")The two objectives in this section are to:
Select the year and the twelve month variables from the weather dataset. We do not need the others (J-D, D-N, DJF, etc.) for this assignment
Convert the dataframe from wide to ‘long’ format ands to use the new dataframe, named tidyweather, to analyse the anomalies, assigned the variable delta.
weather_update <- weather %>%
select(1:13)
head(weather_update)## # A tibble: 6 × 13
## Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1880 -0.34 -0.5 -0.22 -0.29 -0.05 -0.15 -0.17 -0.25 -0.22 -0.31 -0.42 -0.39
## 2 1881 -0.3 -0.21 -0.03 0.01 0.04 -0.32 0.09 -0.03 -0.25 -0.42 -0.36 -0.22
## 3 1882 0.27 0.22 0.02 -0.31 -0.24 -0.29 -0.27 -0.14 -0.24 -0.52 -0.32 -0.68
## 4 1883 -0.57 -0.65 -0.15 -0.3 -0.25 -0.11 -0.05 -0.22 -0.33 -0.16 -0.44 -0.14
## 5 1884 -0.16 -0.1 -0.64 -0.59 -0.35 -0.41 -0.44 -0.5 -0.44 -0.44 -0.57 -0.46
## 6 1885 -1 -0.44 -0.23 -0.48 -0.58 -0.44 -0.34 -0.41 -0.4 -0.37 -0.38 -0.11
tidyweather <- weather_update %>%
pivot_longer(cols=2:13,
names_to="Month",
values_to = "delta")
head(tidyweather)## # A tibble: 6 × 3
## Year Month delta
## <dbl> <chr> <dbl>
## 1 1880 Jan -0.34
## 2 1880 Feb -0.5
## 3 1880 Mar -0.22
## 4 1880 Apr -0.29
## 5 1880 May -0.05
## 6 1880 Jun -0.15
Thus the dataframe now has three variables, for:
To plot the data in a time-series scatter plot, and add a trendline, we first need to create a new variable called date in order to ensure that the delta values are plotted chronologically.
tidyweather <- tidyweather %>%
mutate(date = ymd(paste(as.character(Year), Month, "1")),
month = month(date, label=TRUE),
year = year(date))
ggplot(tidyweather, aes(x=date, y = delta))+
geom_point()+
geom_smooth(color="red") +
theme_bw() +
labs (
title = "Weather Anomalies"
)Analysing this plot leads us to believe that the temperature increase is more pronounced in some months than others. Specifically, we see this when we look at these months: February, March, November and December. The graph depicts these months reaching into the 1.5-2 delta range which tells us that their temperatures were especially abnormal since most months come in at the 1-1.5 delta range.
To produce more useful plots, we will group the years into decades and filter out the decades prior to 1880.
comparison <- tidyweather %>%
filter(Year>= 1881) %>% #remove years prior to 1881
#create new variable 'interval', and assign values based on criteria below:
mutate(interval = case_when(
Year %in% c(1881:1920) ~ "1881-1920",
Year %in% c(1921:1950) ~ "1921-1950",
Year %in% c(1951:1980) ~ "1951-1980",
Year %in% c(1981:2010) ~ "1981-2010",
TRUE ~ "2011-present"
))
head(comparison)## # A tibble: 6 × 7
## Year Month delta date month year interval
## <dbl> <chr> <dbl> <date> <ord> <dbl> <chr>
## 1 1881 Jan -0.3 1881-01-01 Jan 1881 1881-1920
## 2 1881 Feb -0.21 1881-02-01 Feb 1881 1881-1920
## 3 1881 Mar -0.03 1881-03-01 Mar 1881 1881-1920
## 4 1881 Apr 0.01 1881-04-01 Apr 1881 1881-1920
## 5 1881 May 0.04 1881-05-01 May 1881 1881-1920
## 6 1881 Jun -0.32 1881-06-01 Jun 1881 1881-1920
We can now employ this to create a density plot:
ggplot(comparison, aes(x=delta, fill=interval))+
geom_density(alpha=0.2) + #density plot with tranparency set to 20%
theme_bw() + #theme
labs (
title = "Density Plot for Monthly Temperature Anomalies",
y = "Density" #changing y-axis label to sentence case
)Continuing the analysis, we will look at annual anomalies in the data, using the following code:
#creating yearly averages
average_annual_anomaly <- tidyweather %>%
group_by(Year) %>% #grouping data by Year
# creating summaries for mean delta
# use `na.rm=TRUE` to eliminate NA (not available) values
summarise(annual_average_delta= mean(delta, na.rm=TRUE ))
head(average_annual_anomaly)## # A tibble: 6 × 2
## Year annual_average_delta
## <dbl> <dbl>
## 1 1880 -0.276
## 2 1881 -0.167
## 3 1882 -0.208
## 4 1883 -0.281
## 5 1884 -0.425
## 6 1885 -0.432
#plotting the data:
ggplot(average_annual_anomaly, aes(x=Year, y= annual_average_delta))+
geom_point()+
#Fit the best fit line, using LOESS method
geom_smooth() +
#change to theme_bw() to have white background + black frame around plot
theme_bw() +
labs (
title = "Average Yearly Anomaly",
y = "Average Annual Delta"
) deltaNASA points out on their website that
A one-degree global change is significant because it takes a vast amount of heat to warm all the oceans, atmosphere, and land by that much. In the past, a one- to two-degree drop was all it took to plunge the Earth into the Little Ice Age.
We will thus construct a confidence interval (using both the formula and bootstrapping) for the average annual delta from 2011:
formula_ci <- comparison %>%
# choose the interval 2011-present
filter(interval == "2011-present") %>%
summarise(mean_delta = mean(delta, na.rm = TRUE),
median_delta = median(delta, na.rm = TRUE),
sd_delta = sd(delta, na.rm = TRUE),
count = n(),
t_critical = qt(0.975, count - 1),
se_delta = sd_delta/sqrt(count),
margin_of_error = t_critical * se_delta,
delta_low = mean_delta - margin_of_error,
delta_high = mean_delta + margin_of_error)
formula_ci## # A tibble: 1 × 9
## mean_delta median_delta sd_delta count t_critical se_delta margin_of_error
## <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 1.06 1.04 0.274 132 1.98 0.0239 0.0473
## # … with 2 more variables: delta_low <dbl>, delta_high <dbl>
library(infer)
set.seed(1234)
boot_delta <- comparison %>%
filter(interval == "2011-present") %>%
specify(response = delta) %>%
generate(reps = 1000, type = "bootstrap")%>%
calculate(stat = "mean")
percentile_ci <- boot_delta %>%
get_confidence_interval (level = 0.95, type = "percentile")
percentile_ci## # A tibble: 1 × 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 1.02 1.11
The data is showing us that the average annual delta in temperature since 2011-present is 1.059609. We know this as we took the original dataframe and filtered for the interval between 2011-present. We then used the summarize function to get different statistics such as mean, median, etc. After this we cross-checked this number by using the bootstrap function. First, we downloaded the infer package and used the set seed function to ensure the random numbers are the same. We then filtered again for the interval of 2011-present, specified the variable we wanted which was delta, generated 1000 samples, and calculated the mean of them. Finally, we got a 95% confidence interval and found out the mean of the bootstrap samples were similar to the whole population and we were satisfied.
A 2010 Pew Research poll asked 1,306 Americans, “From what you’ve read and heard, is there solid evidence that the average temperature on earth has been getting warmer over the past few decades, or not?”
In this section we analyse whether there are any differences between the proportion of people who believe the earth is getting warmer and their political ideology. As usual, from the survey sample data, we will use the proportions to estimate values of population parameters. The file has 2253 observations on the following 2 variables:
party_or_ideology: a factor (categorical) variable with levels Conservative Republican, Liberal Democrat, Mod/Cons Democrat, Mod/Lib Republicanresponse : whether the respondent believes the earth is warming or not, or Don’t know/ refuse to answerglobal_warming_pew <- read_csv(here::here("data", "global_warming_pew.csv"))You will also notice that many responses should not be taken into consideration, like “No Answer”, “Don’t Know”, “Not applicable”, “Refused to Answer”.
global_warming_pew %>%
count(party_or_ideology, response)## # A tibble: 12 × 3
## party_or_ideology response n
## <chr> <chr> <int>
## 1 Conservative Republican Don't know / refuse to answer 45
## 2 Conservative Republican Earth is warming 248
## 3 Conservative Republican Not warming 450
## 4 Liberal Democrat Don't know / refuse to answer 23
## 5 Liberal Democrat Earth is warming 405
## 6 Liberal Democrat Not warming 23
## 7 Mod/Cons Democrat Don't know / refuse to answer 45
## 8 Mod/Cons Democrat Earth is warming 563
## 9 Mod/Cons Democrat Not warming 158
## 10 Mod/Lib Republican Don't know / refuse to answer 23
## 11 Mod/Lib Republican Earth is warming 135
## 12 Mod/Lib Republican Not warming 135
We will be constructing three 95% confidence intervals to estimate population parameters, for the % who believe that Earth is warming, according to their party or ideology excluding the Dont know / refuse to answer respondents.
updated_global_warming_pew <- global_warming_pew %>%
count(party_or_ideology, response) %>%
filter(response != "Don't know / refuse to answer") %>%
pivot_wider(names_from = response,
values_from = n ) %>%
janitor::clean_names() %>%
mutate( total = earth_is_warming + not_warming ,
proportion= earth_is_warming / total,
se = sqrt(proportion * (1-proportion)/ total),
lower = proportion - 1.96*se,
upper = proportion + 1.96*se
)
#updated_global_warming_pew
print.data.frame(updated_global_warming_pew )## party_or_ideology earth_is_warming not_warming total proportion se
## 1 Conservative Republican 248 450 698 0.355 0.0181
## 2 Liberal Democrat 405 23 428 0.946 0.0109
## 3 Mod/Cons Democrat 563 158 721 0.781 0.0154
## 4 Mod/Lib Republican 135 135 270 0.500 0.0304
## lower upper
## 1 0.320 0.391
## 2 0.925 0.968
## 3 0.751 0.811
## 4 0.440 0.560
In line with The challenging politics of climate change, the respondent’s ideology is not independent of beliefs about the earth warming. We can see that Republicans (both conservative and not) are less likely to believe that the earth is warming. We can also see that the Conservative Republicans are the least likely to believe that global warming is a real issue, with only 35.5% of representatives of this ideology considering global warming a real issue.
On the contrary, democrats are more likely to believe that earth warming is real, with representatives of Liberal Democrats having the highest percentage (94.6%).
As we saw in class, fivethirtyeight.com has detailed data on all polls that track the president’s approval
# Import approval polls data directly off fivethirtyeight website
approval_polllist <- read_csv('https://projects.fivethirtyeight.com/biden-approval-data/approval_polllist.csv')
glimpse(approval_polllist)## Rows: 1,819
## Columns: 22
## $ president <chr> "Joseph R. Biden Jr.", "Joseph R. Biden Jr.", "Jos…
## $ subgroup <chr> "All polls", "All polls", "All polls", "All polls"…
## $ modeldate <chr> "10/1/2021", "10/1/2021", "10/1/2021", "10/1/2021"…
## $ startdate <chr> "1/19/2021", "1/19/2021", "1/20/2021", "1/20/2021"…
## $ enddate <chr> "1/21/2021", "1/21/2021", "1/21/2021", "1/21/2021"…
## $ pollster <chr> "Rasmussen Reports/Pulse Opinion Research", "Morni…
## $ grade <chr> "B", "B", "B+", "B", "B-", "B", "B+", "B-", "B", "…
## $ samplesize <dbl> 1500, 15000, 1516, 1993, 1115, 15000, 941, 1200, 1…
## $ population <chr> "lv", "a", "a", "rv", "a", "a", "rv", "rv", "lv", …
## $ weight <dbl> 0.3382, 0.2594, 1.2454, 0.0930, 1.1014, 0.2333, 1.…
## $ influence <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ approve <dbl> 48, 50, 45, 56, 55, 51, 63, 58, 48, 52, 53, 55, 48…
## $ disapprove <dbl> 45, 28, 28, 31, 32, 28, 37, 32, 47, 29, 29, 33, 47…
## $ adjusted_approve <dbl> 50.5, 48.6, 46.5, 54.6, 53.9, 49.6, 58.7, 56.9, 50…
## $ adjusted_disapprove <dbl> 38.8, 31.3, 28.4, 34.3, 32.9, 31.3, 38.0, 33.1, 40…
## $ multiversions <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ tracking <lgl> TRUE, TRUE, NA, NA, NA, TRUE, NA, NA, TRUE, TRUE, …
## $ url <chr> "https://www.rasmussenreports.com/public_content/p…
## $ poll_id <dbl> 74247, 74272, 74327, 74246, 74248, 74273, 74256, 7…
## $ question_id <dbl> 139395, 139491, 139570, 139394, 139404, 139492, 13…
## $ createddate <chr> "1/22/2021", "1/28/2021", "2/2/2021", "1/22/2021",…
## $ timestamp <chr> "10:23:09 1 Oct 2021", "10:23:09 1 Oct 2021", "1…
# Use `lubridate` to fix dates, as they are given as characters.
library(lubridate)
updated_approval_polllist <- approval_polllist %>%
mutate(modeldate = mdy(paste(as.character(modeldate))),
startdate = mdy(paste(as.character(startdate))),
enddate= mdy(paste(as.character(enddate))),
createddate= mdy(paste(as.character(createddate))),
week= isoweek(enddate))
updated_approval_polllist## # A tibble: 1,819 × 23
## president subgroup modeldate startdate enddate pollster grade samplesize
## <chr> <chr> <date> <date> <date> <chr> <chr> <dbl>
## 1 Joseph R… All pol… 2021-10-01 2021-01-19 2021-01-21 Rasmuss… B 1500
## 2 Joseph R… All pol… 2021-10-01 2021-01-19 2021-01-21 Morning… B 15000
## 3 Joseph R… All pol… 2021-10-01 2021-01-20 2021-01-21 YouGov B+ 1516
## 4 Joseph R… All pol… 2021-10-01 2021-01-20 2021-01-21 Morning… B 1993
## 5 Joseph R… All pol… 2021-10-01 2021-01-20 2021-01-21 Ipsos B- 1115
## 6 Joseph R… All pol… 2021-10-01 2021-01-20 2021-01-22 Morning… B 15000
## 7 Joseph R… All pol… 2021-10-01 2021-01-21 2021-01-22 HarrisX B+ 941
## 8 Joseph R… All pol… 2021-10-01 2021-01-21 2021-01-23 RMG Res… B- 1200
## 9 Joseph R… All pol… 2021-10-01 2021-01-20 2021-01-24 Rasmuss… B 1500
## 10 Joseph R… All pol… 2021-10-01 2021-01-21 2021-01-23 Morning… B 15000
## # … with 1,809 more rows, and 15 more variables: population <chr>,
## # weight <dbl>, influence <dbl>, approve <dbl>, disapprove <dbl>,
## # adjusted_approve <dbl>, adjusted_disapprove <dbl>, multiversions <chr>,
## # tracking <lgl>, url <chr>, poll_id <dbl>, question_id <dbl>,
## # createddate <date>, timestamp <chr>, week <dbl>
We will now create a plot of Biden’s net approval ratings since entering office, using the following plot as a reference:
new_approval_polllist <- updated_approval_polllist %>%
mutate(net_approval_rate = approve - disapprove) %>%
group_by(week) %>%
summarise(average_net_approval_rate = mean(net_approval_rate),
median_rate = median(net_approval_rate),
sd_rate = sd(net_approval_rate),
count = n(),
t_critical = qt(0.975, count - 1),
se_rate = sd_rate/sqrt(count),
margin_of_error = t_critical * se_rate,
CI_rate_low = average_net_approval_rate - margin_of_error,
CI_rate_high = average_net_approval_rate + margin_of_error)
head(new_approval_polllist)## # A tibble: 6 × 10
## week average_net_approval_rate median_rate sd_rate count t_critical se_rate
## <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl>
## 1 3 20.2 23 7.90 28 2.05 1.49
## 2 4 18.1 22 8.58 51 2.01 1.20
## 3 5 16.2 20 7.82 49 2.01 1.12
## 4 6 16.7 20.5 7.43 40 2.02 1.18
## 5 7 16.1 19 7.48 52 2.01 1.04
## 6 8 13.8 15.5 7.46 54 2.01 1.02
## # … with 3 more variables: margin_of_error <dbl>, CI_rate_low <dbl>,
## # CI_rate_high <dbl>
ggplot(new_approval_polllist, aes(x = week, y = average_net_approval_rate)) +
geom_point(color = "orange", size = 0.6) +
geom_smooth(color = "blue", size = 0.6, se = FALSE) +
geom_line(color = "orange", size = 0.3) +
geom_line(aes(y = CI_rate_high), color = "orange", size = 0.3) +
geom_line(aes(y = CI_rate_low), color = "orange", size = 0.3) +
geom_line(aes(y = 0), color = "orange", size = 1) +
scale_color_grey(aes(ymin = CI_rate_low, ymax = CI_rate_high)) +
geom_ribbon(aes(ymin = CI_rate_low, ymax = CI_rate_high), alpha = 1/10) +
theme_bw() +
labs( y = "Average Approval Margin",
x = "Week of the Year",
title = "Estimating Approval Margin for Joe Biden",
subtitle = "Weekly average of all polls") We were told by Prof. Kostis to leave this question due to missing data.
week_3 <- new_approval_polllist %>%
filter(week == 9) %>%
select (week, CI_rate_low, CI_rate_high)
week_3## # A tibble: 1 × 3
## week CI_rate_low CI_rate_high
## <dbl> <dbl> <dbl>
## 1 9 10.7 15.0
week_25 <- new_approval_polllist %>%
filter(week == 31) %>%
select (week, CI_rate_low, CI_rate_high)
week_25 ## # A tibble: 1 × 3
## week CI_rate_low CI_rate_high
## <dbl> <dbl> <dbl>
## 1 31 4.97 8.09
Recall the TfL data on how many bikes were hired every single day. We can get the latest data by running the following:
url <- "https://data.london.gov.uk/download/number-bicycle-hires/ac29363e-e0cb-47cc-a97a-e216d900a6b0/tfl-daily-cycle-hires.xlsx"
# Download TFL data to temporary file
httr::GET(url, write_disk(bike.temp <- tempfile(fileext = ".xlsx")))## Response [https://airdrive-secure.s3-eu-west-1.amazonaws.com/london/dataset/number-bicycle-hires/2021-09-23T12%3A52%3A20/tfl-daily-cycle-hires.xlsx?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJJDIMAIVZJDICKHA%2F20211002%2Feu-west-1%2Fs3%2Faws4_request&X-Amz-Date=20211002T184231Z&X-Amz-Expires=300&X-Amz-Signature=7cdc644569fed5c595cc2883782e6079e173ea3375d978ab9a55c59fff9de48d&X-Amz-SignedHeaders=host]
## Date: 2021-10-02 18:42
## Status: 200
## Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
## Size: 174 kB
## <ON DISK> /var/folders/1t/0y34vl815b9fhdq__tk33y100000gn/T//RtmpifnPBS/file1a711a0014cf.xlsx
# Use read_excel to read it as dataframe
bike0 <- read_excel(bike.temp,
sheet = "Data",
range = cell_cols("A:B"))
# change dates to get year, month, and week
bike <- bike0 %>%
clean_names() %>%
rename (bikes_hired = number_of_bicycle_hires) %>%
mutate (year = year(day),
month = lubridate::month(day, label = TRUE),
week = isoweek(day))We can easily create a facet grid that plots bikes hired by month and year.
Considering the May & June data, standard deviation of 2020 data seems higher compared to other years, because the lines of 2020 May & June data are flatter compared to other years. So, in all years except 2020, around 40k bikes were rented in average both in May & June. However, in 2020 data is not concentrated around the mean. The reason might be, of course, the Covid-19. So, due to the pandemic, we probably do not have enough data to drive meaningful statistics for bike rentals in May & June 2020, that’s why standard deviation for 2020 May & June data seems higher.
We will now move on to reproducing the following two graphs.
The second one looks at percentage changes from the expected level of weekly rentals. The two grey shaded rectangles correspond to Q2 (weeks 14-26) and Q4 (weeks 40-52).
For both of these graphs, you have calculated the expected number of rentals per week or month between 2016-2019 and then, see how each week/month of 2020-2021 compares to the expected rentals, using the calculation excess_rentals = actual_rentals - expected_rentals.
The mean should be used to calculate the expected rentals per month.
It is likely that the number of rentals on weekdays and weekends will vary by a lot , and the mean is able to take into account all values in the week. This gives a more accurate idea of the average number of rentals in the overall month. On the other hand, the median will only take the middle value in the data-set within the week. This is likely to be a weekday, since there are 5 weekdays in a week and 2 weekends. Median thus does not reflect the average rentals per month accurately.
head(bike)## # A tibble: 6 × 5
## day bikes_hired year month week
## <dttm> <dbl> <dbl> <ord> <dbl>
## 1 2010-07-30 00:00:00 6897 2010 Jul 30
## 2 2010-07-31 00:00:00 5564 2010 Jul 30
## 3 2010-08-01 00:00:00 4303 2010 Aug 30
## 4 2010-08-02 00:00:00 6642 2010 Aug 31
## 5 2010-08-03 00:00:00 7966 2010 Aug 31
## 6 2010-08-04 00:00:00 7893 2010 Aug 31
updated_bike <- bike %>%
filter (year >2015) %>%
group_by(month, year) %>%
summarise (actual_mean_rental = mean(bikes_hired))
updated_bike_2016_2019 <- bike %>%
filter (year >2015 & year< 2020) %>%
group_by (month) %>%
summarise (expected_rental = mean(bikes_hired)) #expected rentals per month
final_bike <- left_join(updated_bike, updated_bike_2016_2019) %>%
mutate (excess_rentals= actual_mean_rental - expected_rental,
up= ifelse(actual_mean_rental > expected_rental, actual_mean_rental,
expected_rental),
down= ifelse(actual_mean_rental < expected_rental, actual_mean_rental,
expected_rental))
final_bike## # A tibble: 68 × 7
## # Groups: month [12]
## month year actual_mean_rental expected_rental excess_rentals up down
## <ord> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Jan 2016 18914. 20617. -1703. 20617. 18914.
## 2 Jan 2017 20596. 20617. -20.6 20617. 20596.
## 3 Jan 2018 20836. 20617. 218. 20836. 20617.
## 4 Jan 2019 22123. 20617. 1505. 22123. 20617.
## 5 Jan 2020 22893. 20617. 2276. 22893. 20617.
## 6 Jan 2021 13218. 20617. -7399. 20617. 13218.
## 7 Feb 2016 20608. 22049. -1441. 22049. 20608.
## 8 Feb 2017 22091. 22049. 42.1 22091. 22049.
## 9 Feb 2018 20587. 22049. -1462. 22049. 20587.
## 10 Feb 2019 24961. 22049. 2912. 24961. 22049.
## # … with 58 more rows
ggplot(final_bike, aes(x=month, y= actual_mean_rental)) +
geom_line(aes(group=1)) + #actual values
geom_line(aes(y= expected_rental, group=1), size=0.8, color= "blue") +
#expected values
facet_wrap(vars(year)) +
theme_bw()+
geom_ribbon(aes(ymin=expected_rental,ymax=up, group=1),fill="darkseagreen3",
alpha=0.4)+
geom_ribbon(aes(ymin=down,ymax=expected_rental, group=1),fill="indianred3",
alpha=0.4) +
labs(y = "Bike Rentals",
title= "Monthly changes in TFL bike rentals",
subtitle= "Change from monthly average shown in blue and calculated between 2016-2019")bike_change <- bike %>%
filter((year >= 2016 & year <= 2020) | (year == 2021 & week <53)) %>%
group_by(week,year) %>%
mutate(bikes_each_week = sum(bikes_hired)) %>% #sum up no.of bikes in a week
group_by(week) %>%
mutate(average_each_week = median(bikes_each_week), #median bikes in same week
excess_rentals_percentage =
(bikes_each_week - average_each_week)/average_each_week)
replace_na(list(
excess_rentals_percentage = 1
)) ## $excess_rentals_percentage
## [1] 1
print.data.frame(head(bike_change))## day bikes_hired year month week bikes_each_week average_each_week
## 1 2016-01-01 9922 2016 Jan 53 22062 59380
## 2 2016-01-02 7246 2016 Jan 53 22062 59380
## 3 2016-01-03 4894 2016 Jan 53 22062 59380
## 4 2016-01-04 20644 2016 Jan 1 132071 126680
## 5 2016-01-05 22934 2016 Jan 1 132071 126680
## 6 2016-01-06 23199 2016 Jan 1 132071 126680
## excess_rentals_percentage
## 1 -0.6285
## 2 -0.6285
## 3 -0.6285
## 4 0.0426
## 5 0.0426
## 6 0.0426
ggplot(bike_change, aes(x = week, y = excess_rentals_percentage))+
scale_x_continuous(breaks = seq(0, 53, by = 13)) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
coord_cartesian(ylim = c(-1,1)) +
geom_line(aes(y = excess_rentals_percentage)) +
geom_rect(aes(xmin=13, xmax=27, ymin=-2, ymax=2), color="grey99", alpha=0.1) +
geom_rect(aes(xmin=39, xmax=53, ymin=-2, ymax=2), color="grey99", alpha=0.1) +
geom_ribbon(aes(ymin=pmin(excess_rentals_percentage,0), ymax=0),
fill="indianred3", alpha=0.5) +
geom_ribbon(aes(ymin=0, ymax=pmax(excess_rentals_percentage,0)),
fill="darkseagreen3", alpha=0.5) +
geom_rug(data = subset(bike_change, excess_rentals_percentage < 0 ),
color="indianred3", sides="b") +
geom_rug(data = subset(bike_change, excess_rentals_percentage >= 0 ),
color="darkseagreen3", sides="b") +
facet_wrap(~year, nrow = 2) +
theme(legend.position="none") +
theme_minimal() +
labs (
title = "Weekly changes in TfL bike rentals",
subtitle = "% change from weekly averages calculated between 2016-2019",
x = "week",
y = "",
caption = "Source: TfL, London Data Store",
) +
NULL For this challenge, we will be reproducing the following graph, but from the year 2000 onwards:
This graph has too many sub-categories, so using the relative importance of components in the Consumer Price Indexes: U.S. city average, December 2020 we chose a smaller subset of the components and only listed the major categories (Housing, Transportation, Food and beverages, Medical care, Education and communication, Recreation, and Apparel).
url <- "https://fredaccount.stlouisfed.org/public/datalist/843"
library(rvest)
tables <- url %>% # Reading in data from site
read_html() %>%
html_nodes(css ="table")
cpi <- map(tables, . %>%
html_table(fill=TRUE) %>%
janitor::clean_names()) # Ensuring names are in a useable format
cpi_data <- cpi[[2]]
references <- as.vector(pull(cpi_data, series_id))
cpi_components <- references %>%
tidyquant::tq_get(get = "economic.data", from = "1999-01-01") %>%
mutate(cpi_yoy_change = price/lag(price, 12) - 1) %>% # Change calculation
na.omit(cpi_components) # Get rid of any NA values
# 1999 was chosen as the year from which to pull as the lag function reads from
# 12 months prior, and so pulling from 2000 means you can only get calculable
# results from 2001 January. The NA results for 1999 are then filtered out
cpi_components_major <- cpi_components %>%
filter((symbol == "CPIAUCSL") | (symbol == "CPIHOSSL") |
(symbol == "CPITRNSL") | (symbol == "CPIFABSL")
| (symbol == "CPIAPPSL")) %>%
dplyr::mutate(titles = case_when(symbol == "CPIAUCSL" ~ "All Items",
symbol == "CPIHOSSL" ~ "Housing",
symbol == "CPITRNSL" ~ "Transport",
symbol == "CPIFABSL" ~ "Food and Beverages",
symbol == "CPIAPPSL" ~ "Apparel"))
ggplot(cpi_components_major, aes(x = date, y=cpi_yoy_change))+
#scale_x_continuous(breaks = seq(0, 54, by = 13)) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
coord_cartesian(ylim = c(-0.5, 0.5)) +
geom_point(size = 0.5, aes(color=ifelse(cpi_yoy_change<0, "red4",
"steelblue4")),
show.legend = FALSE) +
geom_smooth(color = "gray") +
facet_wrap(~titles, nrow = 2) +
theme(legend.position = "none") +
theme_minimal() +
labs (
title = "Yearly Change of US CPI and its Major Components",
subtitle = "YOY Change Being Positive or Negative,
Jan 2000 to Aug 2021 ",
x = "",
y = "YoY % Change",
caption = "Data from FRED
https://fredaccount.stlouisfed.org/public/datalist/843",
) +
NULLPlease seek out help when you need it, and remember the 15-minute rule. You know enough R (and have enough examples of code from class and your readings) to be able to do this. If you get stuck, ask for help from others, post a question on Slack– and remember that I am here to help too!
As a true test to yourself, do you understand the code you submitted and are you able to explain it to someone else?
Yes